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Abstract 



This paper presents an application of a stochastic approximation EM-algorithm using 
a Metropolis-Hastings sampler to estimate the parameters of an item response latent 
regression model. Latent regression models are extensions of item response theory (IRT) to 
a 2-level latent variable model in which covariates serve as predictors of the conditional 
distribution of ability. Applications for estimating latent regression models for data from 
the 2000 National Assessment of Educational Progress (NAEP) grade 4 math assessment 
and the 2002 grade 8 reading assessment are presented and results of the proposed method 
are compared to results obtained using current operational procedures. 
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1 Introduction 



Item response theory (IRT; Lord & Novick, 1968) is the method of choice for assessment 
designs involving multiple test forms that require linking in order to compare individuals 
or groups based on the same variables. Standard examples of such assessment designs are 
educational survey assessments such as the National Assessment of Educational Progress 
(NAEP), the Programme for International Student Assessment (PISA), and the Trends in 
International Mathematics and Science Study (TIMSS), and the Progress in International 
Reading Literacy Study (PIRLS), to name a few. In these assessments, every examinee 
responds to one out of many test booklets addressing the same content domain(s) with 
different, but overlapping subsets of items. 

The testing time in these assessments is short, since the results obtained are 
inconsequential for the participants. Therefore, in order to cover the prohciency domain 
sufficiently, multiple booklets are used, each of which contains a small subset of the total 
number of items. This design feature allows several hundreds of items to be covered, while 
each student is tested for only about 60-90 minutes. 

Several extensions of IRT have been developed for supplementing the relatively sparse 
information obtained from the short booklets used in educational survey assessments. 
Many of these developments follow original ideas by Mislevy (1984). von Davier, Sinharay, 
Oranje, and Beaton (2006) described recent developments and operational procedures 
used in NAEP. The major extension to IRT used in these methods is based on the 
need for estimating unbiased group-level distributions of prohciency for policy-relevant 
subpopulations (such as gender groups) and groups dehned by self-reported ethnicity. 
These extensions of IRT make use of the fact that a large number of background variables 
(such as gender, ethnicity, and socio-economic variables) are collected as covariates of the 
prohciency variables during the survey. Examinees are asked to respond to a questionnaire 
section containing questions on these and other variables, such as attitudes toward content 
domains like physics, mathematics, or reading (depending on what is assessed in the test), 
as well as study skills, and other covariates. These covariates are used in a latent regression 
model that extends IRT to a two-level latent variable model in which the background 
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variables serve as predictors of the conditional distribution of ability. 

In addition to the presence of a (potentially large) number of covariates, survey 
assessments often contain multiple subscales of the content domain of interest. These 
subscales are designed in such a way that each item in the survey is assumed to measure 
only one scale, so that the set of items measuring the k content (subdomains) can be 
treated as a /c-scale simple-structure multidimensional IRT (MIRT) model. The correlations 
between subscales are included in the model by means of extending the latent regression to a 
multivariate, multiple latent regression, in which the conditional mean vector of prohciencies 
(subscales) is predicted by the set of covariates, while the conditional covariance matrix of 
prohciencies is assumed to be the same for all subpopulations (see Thomas, 2002, and von 
Davier et ah, 2006, for models relaxing these assumptions). 

This multivariate multiple latent regression model requires the evaluation or 
approximation of multidimensional integrals, a task that in past decades required days on 
working on computers if large numbers of covariates were used. Additionally, the dimension 
is /c > 3. Research has attempted to reduce the computational burden either by using 
approximations (Thomas, 1993; von Davier, 2003) or by using stochastic methods (Adams, 
Wilson, & Wu, 1997; von Davier & Sinharay, 2007). 

Neither of these approaches is completely satisfactory. They either rely on strong 
assumptions about the shape of the functions involved, or they use stochastic methods that 
do not rely on strong assumptions and do not deliver the promised increase in speed (at 
least when using simulation sample sizes to provide sufficient accuracy). This paper explores 
an alternative to the methods used so far for multivariate latent regressions and presents an 
application of a stochastic approximation EM-algorithm using a Metropolis-Hastings (MH) 
sampler (Cai, 2007; Delyon, Lavielle, & Moulines, 1999; Gu & Kong, 1998; Gu & Zhu, 2001) 
to estimate the parameters of an item response latent regression model. Applications for 
estimating latent regression models for NAEP data from the 2000 grade 4 math assessment 
and the 2002 grade 8 reading assessment are presented, and results of the proposed method 
are compared to results obtained using current operational procedures. 
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2 Item Response Latent Regression Models 

Assume Yi^, ... ,Yj categorical observed random variables and Xi, ... , Xk real valued 
random variables, with realizations on these variables sampled from N examinees. Let 
(y^, x„) = , y^i, , x.^k) denote the realizations for examinee v. Let (6^1 , . . . ,9d) 

be a d-dimensional real number representing the ability variable, and let 

I 

i=l 

be a model describing the probability of the observed categorical variables given the 
d-dimensional parameter 9, referred to as proficiency variable in the subsequent text. For 
these product terms, assume that 

Pfiy\9)=P{y\9-Q) 



with some vector-valued parameter Q. Examples are unidimensional item response models, 
such as 



Pfiy\9^,...,9,) = 



eyrpiiiafij — b, 






1 + exp{zai9j - bifij '' 



where the probability depends on one component, 9j. Similar models suitable for generating 
multinomial probabilities for (mj -|- l)-categorical variables can also be considered. 

For the prohciency variable 9 = {9i, . . . ,9d), assume that 



0„~iv(x,r,s), 

with d X d- variance-covariance matrix S and iL-dimensional regression parameter F. Let 
0(0; I x^r, S) denote the multivariate normal density with mean x^F and covariance matrix 
S. Then, the likelihood of an observation (y„,x„) expressed as a function of F, and S 
can be written as 

L(C, F, S; y„, x,) = /(H I 0; C))0(d | x,F, S)dd, 
and, for the log-likelihood function involving all N examinees, we have 

N I 

logL(C,F,S;Y,X) = inte{\[P{y.^ \ d;C))0(d | x„F, S)dd. (1) 

^!=l i=l 
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This log-likelihood function is the objective function to be maximized with respect to 
parameters S, F, and (. 

Current Operational Estimation Procedures 

In NAEP operational analyses, the item parameters C, associated with the P{yi | C) 
are often determined in a separate estimation phase prior to the estimation of the latent 
regression 0(0 | x„r,S). In the second phase, the latent regression parameters S and F 
are estimated using the previously determined item estimates of C,. These item parameters 
are plugged into the second phase as hxed and known constants, von Davier et ah (2006) 
described this procedure, which is the current method of choice in analysis of many national 
and international large-scale survey assessments, including NAEP, TIMSS, PIRLS, and 
PISA. 

The evaluation of the likelihood function in (1) requires the calculation or approximation 
of multidimensional integrals over the range of 9. The evaluation of these integrals is quite 
time-consuming for integrals with dimension d > 5. Even if adaptive numerical integration 
methods are used, the time required per integral increases with the number of dimensions, 
and the integration has to be repeated N times in each step of the estimation. 

Mislevy, Beaton, Kaplan, and Sheehan (1992) suggested use of the EM-algorithm 
(Dempster, Laird, & Rubin, 1977) for maximization of L in (1) with respect to Fand S. 
The EM-algorithm treats estimation problems with latent variables as an incomplete data 
problem and completes the required statistics in one phase (the E-step, where E stands for 
expectation) , and then maximizes the resulting complete-data likelihood with respect to the 
parameters of interest (the M-step, where M stands for maximization) . In the case of the 
latent regression, the EM-algorithm operates as follows: 

E-step: The integral over the unobserved latent variable 9 is evaluated in the (t)-th E-step 
using the provisional estimates and F*^*b As in the previous section, let 
PiVvi I Ci) denote the probability of response given ability 9 and item 
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parameters Q. Then we have 



/(n^(^-l^;O)0(^|x.rw,sW)d0. 

i=i 

Let the posterior mean of 6 be denoted by Ov^ = -E(s.r,c)(*) I yv,^v)- Mislevy et ah 
(1992) argued that the posterior mean 9^ can subsequently be plugged into the 
ordinary least squares (OLS) estimation equations. Note that the calculation of the 
9^^'^ also requires the evaluation or approximation of an integral, since calculation of 
the expectation 

^(s,r,c)w(0 I y.,x.) = [ 9{l[P{y,, \ 0;C))0(^ | 0(x,r«,sW)d0 

i=i 

is involved. 

M-step: Following the arguments put forward by Mislevy et al. (1992), the maximization 
of the likelihood for step {t + 1) based on provisional estimates from the preceding 
E-step is accomplished by using OLS estimation equations with the posterior means 
generated in the E-step, as outlined in the previous paragraphs. This yields 

An estimate of the conditional variance-covariance matrix is given by 

= V{9 - xr(*+i)) + E(E(s,r,c)w(^^ I y.,x„)). 

The computationally intensive calculations involved here are the determination of estimates 
or approximations of p ,^)(t)(0 | y„,x^,) and p ,jyt) (^1 | yt,,Xt,) for each of the N 
observations; both calculations involve the numerical evaluation or the approximation 
of (multidimensional) integrals. If the dimension d is large (d > 4 in this context), this 
evaluation can be rather time-consuming. 

3 Metropolis-Hastings Stochastic Approximation 

Maximum-likelihood estimation for likelihood functions that involve a large number 
of random effects or latent variables can be very time consuming due to the need to 
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evaluate integrals in multiple dimensions. Researchers have examined stochastic versions 
of numerical integration methods as potential alternatives that allow estimation of 
high-dimensional models in shorter time. Von Davier and Sinharay (2007) employed 
importance sampling in a stochastic EM-algorithm for estimating the parameters of a latent 
regression with a multidimensional IRT measurement model. Adams et ah (1997) used 
stochastic methods to evaluate integrals for the estimation of multidimensional generalized 
Rasch models. Gu and Kong (1998) demonstrated the use of stochastic approximation 
with a Markov-chain generated by an MH sampler, as did Delyon et al. (1999) for the 
EM-algorithm. Gu and Zhu (2001) as well as Kuhn and Lavielle (2004) showed how 
to combine Markov chain Monte Garlo (MGMG) and stochastic approximation. To our 
knowledge, stochastic approximation has never been applied to large-scale data analysis 
with latent regression models involving hundreds of parameters estimated based on samples 
composed of some 10,000 to 100,000 observations. The rationale behind the proposed MH 
stochastic approximation makes this an area where this method is likely to prove useful. 

Stochastic Approximation for Optimization 

Stochastic approximation methods are often introduced to solve the following problem. 
Assume that there is a function L{a \ z,r]) with a being the parameters to be maximized, 
and observed data z, while the rj are not directly observable quantities. Assume that 
L{a I z,T]) is difficult to calculate due to the latency of t], for example, while there is a 
proxy to this function, L*{a \ x,(), which can be calculated easily. Assume that 

L(a I z,?7) = L*(a | z,C) -F e, 

with e small. The proxy L* could be viewed as a noisy version of L. If the task is to hnd 
an extreme value of L with respect to parameter a, one might choose to employ either 
gradient descent methods or Newton-Raphson methods, since in most cases there is no 
analytic solution at hand. For gradient descent methods, we have 

o;h+i) = Q-h) _|_ AjVT(q;‘'*^ I z,r]), 
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with some efficient algorithm to determine the step-width A* for each cycle (t) of the 
iterative algorithm. The iterations implied in this discussion are continued until convergence 
is reached, which is often determined by the sequence not changing by more than a 

constant e. Obviously, this may be the case once A* reaches some value smaller than ce for 
some constant c. 

In order to avoid computationally costly evaluations of L{a \ z,?]), stochastic 
approximation methods instead use the noisy replacement L*{a \ z,Q- Alternatively, a 
noisy replacement of the gradient is used, that is, 

VL*{a I z, 0 = | z, C), • • • , T^L*{a \ z, 0), 

oc^i ocyf{ 

so that the updated rule becomes 

^(t+i) _ «(*) + I 

with some step- width St, for which St = oo and S^ < oo holds. Often St = I /t is used. 

In the case considered here, the function L is the likelihood function used to estimate 
the parameters of a latent regression. A regression is termed latent regression when the 
dependent variable is unobserved and has to be replaced by a conditional expectation for 
each observation v. The proxy of this likelihood function, L* , is an estimator of the same 
regression, for example, where the dependent variable is unobserved but values are plugged 
in from the conditional distribution of the dependent variable rather than from an estimate 
of the conditional expectation. In other words, instead of plugging in an estimate of 
9 = E{9 \ z), we use a draw 9 from the posterior ~ P{9 \ z) as the plug-in for the estimator 
(X^X)-iX0. 

The Metropolis -Hastings (MH-) Algorithm 

The MH-algorithm (Gelman, Carlin, Stern, & Rubin, 1995; Hastings, 1970; Metropolis, 
Rosenbluth, Rosenbluth, Teller, & Teller, 1953) is a method to generate a Markov chain of 
draws from a distribution p{9 \ z) without the need for calculating the density as a whole or 
to approximate it. It is sufficient to be able to calculate values proportional to p{9* \ z) for 
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a sequence of values of 0* drawn from some jump distribution fg new^Qold(^0 ^ 5 ^ }* 

The jump distribution Qnew'^ defines a conditional distribution of the new value 

given the previous value 9°^'^. Then the MH-algorithm operates as follows: 

• First, some initial value 9^ is drawn from g(0, 9^) and the old, or previous, value 9°^^ is 
established. 



• Second, a new proposed value 9'^'^'^ is generated using q{9°^^, 9^^'^). 



• Third, it is determined whether this new 9^’^'^ should be accepted, using the ratio 

_ p{9^^^ I 9°^^) 

^ ~ p{9°^^ I z)q{9°^<^, 9^^^) 

with acceptance rate P{9'^ \ 9^) = min{l,r). 



The calculation of P{9 \ z) often involves expressions of the form 



p{9 I z) 



p(z I 9)P{9) 

I,pO I O)p(e)d0 



SO that the calculation of the ratio r turns out to reduce to a simple expression 



p(z I e^^^)p{9^^^)q{9ne^ ^9old) 
p{z I 9°^d'^p{9°^d'^q{9°’-d ^ 9 new'j ’ 



which is often easy to calculate. 



4 Application to NAEP Latent Regression 

It is proposed to estimate the latent regression, which is the population model 9 = xT + e 
using stochastic approximation by means of an MH sampler, with the commonly employed 
monotone decreasing sequence of weights, or step-widths, 6t = ^. More specihcally, we 
dehne a process that produces draws from the posterior distribution of 9, given observations 
and covariates x^,, using the provisional estimates of and T*'*\ More specihcally, the 
algorithm proposed here will provide draws from the posterior 

~ -P(s, r, <;)(*) (^ I Yv,^v) (2) 

and will base the iterative updating of the estimates and T*^*^ on principles of a 
decreasing sequence of weights 6t as outlined in Gu and Kong (1998) and Cai (2007). 




Metropolis-Hastings Process for Latent Regressions 

Following Gu and Kong (1998) and Cai (2007), we propose to use the MH-algorithm for 
generating draws from the posterior given in (2). A yet to be determined function of these 
draws from the posterior distribution will serve as a noisy estimate of the posterior mean 
required in the EM-algorithm for estimating latent regressions. The posterior probability of 
the d-dimensional ability variable 9 given responses y and covariates x is 



P{9 I y^,,x^) = 



P(y, I e)P{9 I x„) 



jgP{yv I 9)p{9 I X„)d6'’ 

with multivariate normal density P{9 \ x^,) = 0(0 | Xi,F, S), covariance matrix S, and 
conditional mean Xi,F. For the jump distribution in iteration (t), we choose 



0” ~ At(0°,cSW), 



that is q{9°^^ , 9^'^'^) = 0(0'^®"' | 0°*'^,cSW) with some constant c (Gelman et al., 1995). Then, 
the ratio r used to determine the acceptance probability for the next draw in the MH chain 
is given by 



1 


Qnew'^ p ^Qnew 


1 x„) 


P{Yv 


1 Qold'^ p ^Qold 1 


X„) 



since g() = 0() is symmetric. The acceptance probability is dehned as presented in this 
discussion and equals 

p^gneu, | 0°^^) = min{r , 1) . 

This produces a series of draws from P^^\9 \ yt,,Xt,) P(y^ I 0)0 (x„FW,sW). Note 
that there are two terms involved in these calculations that depend on estimates 
obtained in cycle (t). More specihcally, the actual use of q{9°’‘^,9"''^^) in cycle (t) 
requires plugging in the current estimate of the conditional covariance matrix, that is 
q(t) (^Qoid ^ Qnew'^ _ (jy^Qnew | Qoid addition, the conditional distribution depends on the 
estimates obtained in cycle (t) as P^^\9 \ x^,) = (f){9 \ Xt,Fh),Sh)). 
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Updating Regression Parameters Using Stochastic Approxima- 

tion 

The principle behind stochastic approximation methods is a combination of evaluating 
noisy versions of the objective function and dehning a sequence of updates that use 
decreasing weights so that the impact of using noisy functions is averaged out over the 
course of the optimization process. However, the determination of the latent regression 
parameters following the methods put forward by Mislevy et ah (1992) is conducted as a 
one-step estimator, calculated in each M-step of their proposed EM-algorithm. Plugging in 
the noisy estimate 0*^*Mirectly would yield 

This one-step estimator produces a noisy estimate of the current state of affairs in the 
EM-algorithm. Therefore, following the principle put forward by Robbins and Monro 
(1951), the change in estimates due to noisiness needs to be tempered by dehning a 
weighted average of the current and previous states of the iterations. For a regression 
estimator, this can be done by exploiting the fact that the OLS estimator is linear in the 
dependent variable. Dehne the difference 

ef = (0h) 

and calculate the regression parameter based on this vector of differences of the unobserved 
target of inference between the previous cycle {t — 1) and the current cycle (t), 

= (X'^X)-^X^0j^. 

Then, calculate the update using the weight 

fh+i) = fw + 5t+if 

This update of the previous estimate T^*^ by means of the weighted estimator <5t+iT based 
on the change (or difference), 9^^'^ — is equivalent to 

= (1 - 5t+i)f(') + 
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due to linearity of the OLS estimator. The sequence of updates can be started using f = 0, 
since = 1; therefore, (1 — 5i) = 0. This yields a sequence . . . , , 

with each instance of the T*-*^ being determined recursively by the previous estimates 
r(o) ^ ^ ^ p(t-i) updated with a weight of = 1/t to account for the current noisy, but 
improved, approximation of the system. Similarly, dehne 



with 

sh+b = i/(0-xfb+b). 

The main challenge of the proposed method will be evaluating and determining convergence, 
since the decreasing weights St enforce diminishing differences between two consecutive 
cycles {t) and (t + 1). Therefore, the difference between two estimates may not be the most 
suitable choice of a criterion for determining when the cycles can be brought to an end. 
Sinharay and von Davier (2005)and von Davier and Sinharay (2007) suggested using a 
combination of criteria that evaluate the change in the log-likelihood as well as the change 
in parameters for their proposed version of a stochastic EM- algorithm. 

5 Examples from the National Assessment of Educational Progress (NAEP) 
NAEP 2002 Reading Grade 8 Example 

We analyzed data from the 2002 NAEP reading assessment at grade 8 (see, for 
example. National Center for Education Statistics, 2002). Each of the approximately 
115,000 students was asked either two 25-minute blocks of questions or one 50-minute block 
of questions; each block contained at least one passage and a related set of approximately 
10 to 12 comprehension questions (a combination of four-option multiple-choice and 
constructed- response questions). There were a total of 111 items. As many as 566 predictors 
were used in the latent regression model. Three subskills of reading were assessed: (a) 
reading for literary experience, (b) reading for information, and (c) reading to perform 
a task. For three-subscale assessments such as these, CGROUP, the operationally used 
version of MGROUP, ETS’s set of software programs for NAEP analyses, would typically 
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be used. For this study, we developed SGROUP, an implementation of MGROUP using 
MH stochastic approximation methods. Instead of GGROUP, we used BGROUP as the 
basis of our comparison with SGROUP. Sinharay and von Davier (2005) had analyzed these 
data and found their extended form of the BGROUP program gave more accurate results 
than GGROUP. 

On a desktop computer running Linux with a 2.2 GHZ processor and 2 GB RAM, 
SGROUP used 304 iterations of the MH-algorithm and took approximately 31 minutes to 
converge when using a chain length of 200 MH draws. In contrast, the GGROUP program 
would take about 22 minutes to perform the convergence on the same machine. If the 
number of MH draws per iteration were increased, the time needed to converge would be 
increased proportionally. 

Figure 1 provides a comparison of estimated regression coefficients (i.e., estimates 
of components of F) as differences between GGROUP and BGROUP, and also for the 
differences between SGROUP and BGROUP, for each of the three subskills. 

For convenience of viewing, the range of the vertical scale in the plots in Figure 1 is the 
same. The differences between SGROUP and BGROUP are negligible and are as large as 
those between GGROUP and BGROUP. 

Table 1 compares the residual variance estimates S from BGROUP, GGROUP, and 
SGROUP. The differences of results produced by SGROUP and BGROUP are negligible 
and are mostly smaller than the small differences between the results produced by GGROUP 
and BGROUP. In addition, all three of the variance component estimates are slightly higher 
for GGROUP than for BGROUP, and all three correlations are slightly lower for GGROUP 
than for BGROUP. The estimated variances and covariances produced by SGROUP do not 
have such a systematic deviation from the BGROUP estimates. 

Figures 2 and 3 compare the differences between posterior means and standard 
deviations (SDs) of 10,000 randomly chosen examinees for GGROUP versus BGROUP 
estimates, and for SGROUP versus BGROUP estimates. Both hgures have a plot for each 
subscale. For convenience, the range of the vertical scales in the plots are the same for 
GGROUP and SGROUP. 
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Figure 1. Comparison of regression coefficients from CGROUP, BGROUP, and 
SGROUP for the 2002 NAEP reading assessment at grade 8. 

The rightmost panels of Figure 2 show the differences roughly in the scale reported 
by NAEP. In operational NAEP, a weighted average of the scores on the three subscales 
(with weights 0.35, 0.45, and 0.20 for literacy, information, and performance, respectively) 
is reported. NAEP uses a complicated linking procedure involving data for grades 4, 8, and 
12 — this usually converts the composite score to a scale with mean of approximately 300 
and an SD of approximately 35. For the 2002 NAEP reading assessment at grade 8, the 
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Table 1 



Residual Variances, Covariances, and Correlations 
for the 2002 NAEP Reading Assessment at Grade 8 



BGROUP CGROUP-BGROUP SGROUP-BGROUP 





Lit. 


Inf. 


Per. 


Lit. 


Inf. 


Per. 


Lit. 


Inf. 


Per. 


Literary 


.517 


.363 


.339 


.010 


.004 


.004 


.004 


-.002 


.007 


Information 


.765 


.435 


.339 


-.006 


.009 


.005 


-.007 


-.001 


.002 


Performance 


.733 


.800 


.413 


-.007 


-.006 


.010 


.011 


.004 


.002 



Note. Residual variances are shown on main diagonals, covariances on upper off-diagonals, 

and correlations on lower off-diagonals. 



reported mean of the composite was 264 and the SD of the composite was 35. We do not 
use the rigorous NAEP linking procedure here, but instead use a simpler alternative that 
involves computing the weighted average of the posterior means on the three subscales for 
each examinee, using the weights 0.35, 0.45, and 0.20, as used in NAEP. Then we apply a 
linear transformation of the resulting weighted average to convert it to a scale with a mean 
of 264 and an SD of 35. 

For the most part, results produced by SGROUP are close to those produced by 
BGROUP and closer than those produced by GGROUP. GGROUP has a tendency to 
overestimate high posterior means and underestimate low posterior means (the extent of 
underestimation being more severe, especially for a few examinees). GGROUP slightly 
overestimates the extreme posterior SDs, a phenomenon that was also observed by 
Thomas (1993) and von Davier and Sinharay (2007). The SGROUP does not have any of 
these limitations. 

Table 2 shows the subgroup means and SDs (in parentheses) from BGROUP and the 
difference in these values from SGROUP and BGROUP — there seems to be little difference 
between the results. 
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Figure 2. Comparison of the posterior means from CGROUP, BGROUP, and 
SGROUP for the 2002 NAEP reading assessment at grade 8. 

Math Grade 4 NAEP 2000 

The 2000 math NAEP assessment in grade 4 has hve subscales, namely, (a) number and 
operations, (b) measurement, (c) geometry, (d) data analysis, and (e) algebra. The sample 
consists of approximately 13,900 students. The number of items across all subscales and 
booklets was 173, and 387 predictor variables were included in the latent regression model. 
On a desktop computer running Linux with a 2.2 GHZ processor with 2 GB RAM, 
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Figure 3. Comparison of posterior standard deviations from CGROUP, 
BGROUP, and SGROUP for the 2002 NAEP reading assessment at grade 8. 

the program implementing SGROUP used 421 iterations of the MH-algorithm and took 
approximately 1.5 hours to converge. In contrast, it would take the same machine 2 hours 
to converge CGROUP. These values refer to the MH-algorithm with chain length of 200. 
Longer chains will result in longer run times. The SGROUP program allows one to vary 
the chain length and to increase it programatically after initial iterations with chain lengths 
as little as 100 in the early cycles, and up to 2,000 in the final cycles. Increasing the chain 
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Table 2 



Comparison of Subgroup Estimates From BGROUP and SGROUP for the 
2002 NAEP Reading Assessment at Grade 8 



Subgroup 




BGROUP 




SGROUP-BGROUP 


Literary 


Information 


Task 


Literary 


Information 


Task 


Overall 


0.027 


0.022 


0.022 


-0.001 


0.000 


0.000 




(0.984) 


(0.949) 


(0.970) 


(0.003) 


(-0.001) 


(0.001) 


Male 


-0.116 


-0.078 


-0.133 


-0.001 


0.000 


0.000 




(0.983) 


(0.962) 


(0.967) 


(0.002) 


(-0.001) 


(0.001) 


Female 


0.170 


0.123 


0.178 


0.000 


0.000 


0.000 




(0.965) 


(0.925) 


(0.947) 


(0.002) 


(-0.001) 


(0.002) 


White 


0.286 


0.267 


0.315 


-0.001 


0.000 


0.001 




(0.897) 


(0.852) 


(0.849) 


(0.002) 


(-0.001) 


(0.002) 


Black 


-0.497 


-0.447 


-0.516 


-0.002 


-0.001 


-0.001 




(0.935) 


(0.897) 


(0.903) 


(0.004) 


(0.000) 


(0.001) 


Hispanic 


-0.408 


-0.434 


-0.516 


-0.002 


-0.000 


-0.000 




(0.982) 


(0.981) 


(0.982) 


(0.003) 


(-0.001) 


(0.002) 


Asian 


0.137 


0.200 


0.124 


0.000 


0.001 


-0.001 


American 


(0.959) 


(0.944) 


(0.940) 


(0.003) 


(-0.002) 


(0.001) 


American 


-0.330 


-0.299 


-0.250 


0.000 


-0.001 


-0.001 


Indian 


(0.946) 


(0.922) 


(0.953) 


(0.002) 


(-0.001) 


(-0.003) 



length will increase estimation time and the accuracy of the estimates. Note that BGROUP 
is not applicable to this example, since the number of subscales in the mathematics example 
does not allow the evaluation of a hxed-grid numerical integration over hve dimensions in 
limited time. Trials of BGROUP with four dimensions on a comparable personal computer 
took more than a week to achieve convergence. 

The results obtained indicate that both programs £11 in numeric boundaries for the 
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residual correlation between the five subscales, which are known to be highly correlated, as 
all five math scales measure closely related quantitative skills in different subdomains of 
mathematics as taught in school. Therefore, the results have to be interpreted with some 
care. However, parameter estimates obtained by CGROUP and SGROUP are very similar; 
the correlations between regression coefficients obtained with GGROUP and SGROUP 
range between 0.9997 and 0.9914. 

Table 3 presents correlations between these estimates, and means and standard 
deviation of the estimates obtained by GGROUP and SGROUP. 

Table 3 

Comparison of Means and Standard Deviations From CGROUP and 



SGROUP for the 2002 NAEP Reading Assessment at Grade 8 



Scale 


1 


2 


3 


4 


5 


correlation 


0.99970 


0.99909 


0.99142 


0.99844 


0.99779 


Mean GGROUP 


-0.00059 


0.00069 


-0.00055 


0.00021 


-0.00090 


Mean SGROUP 


-0.00060 


0.00066 


-0.00053 


0.00012 


-0.00088 


SD GGROUP 


0.02143 


0.02274 


0.02199 


0.02425 


0.02385 


SD SGROUP 


0.02128 


0.02242 


0.02181 


0.02379 


0.02325 



A scatterplot depicting the agreement between SGROUP and GGROUP with respect 
to the regression parameters obtained is presented in Figure 4. In this plot, only two of the 
five subscales are compared: Scale 1, with the highest correlation between GGROUP and 
SGROUP, and Scale 3, with the relatively lowest correlation of 0.991. 

As the direct comparison in the top row and the difference plots in the bottom row 
indicate, the relationship between the estimates obtained by GGROUP and SGROUP is 
closely matched by the identity line {X = Y). However, since the high correlations between 
subscales prevented both algorithms from converging without setting numerical bounds to 
residual correlations, further in-depth analyses of similarities and differences between the 
two approaches will not be carried out here. We note that in spite of the limits set on 
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Figure 4- Comparison of regression coefficients obtained for the NAEP math 
2000 grade 4 data using the operational CGROUP and the stochastic approxi- 
mation estimation method (SAEM) implemented in SGROUP. 

Note. Due to very high correlations for all subscales, only a subset of two subscales repre- 
senting the range of results is presented. 

correlations, both algorithms provided very similar estimates for the regression coeffficients 
in this five-dimensional case. 
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6 Conclusions 



This paper describes how a stochastic approximation method can be used in estimating 
latent regression models used in large-scale educational surveys such as NAEP. The 
examples presented in this paper compare the stochastic approximation approach to the 
operational numerical methodologies used in NAEP and other large-scale educational 
survey assessments. The results indicate that stochastic approximation can be used to 
avoid some of the technical assumptions underlying current operational methods. Among 
these are the assumptions used in the Laplace approximation in the operational CGROUP 
approach. Moreover, stochastic approximation methods rely on the draws from the 
posterior distribution of the statistic of interest. This allows one to generate imputations 
(plausible values) without the assumption of posterior normality. This is an advantage 
of the stochastic approximation methods, which may play out favorably in cases where, 
for example, due to reduced dimensionality of the background model or due to a limited 
number of items in the measurement model, the normality assumption used in the current 
operation approach may not be appropriate. The approach implemented in S GROUP 
enables one to draw plausible values without relying on a restrictive assumption about the 
form of the posterior distribution. Note, however, the convergence behavior of stochastic 
methods needs close monitoring. For this purpose, there is need for further investigations 
aimed at optimizing the choice of the annealing process in order to minimize the number of 
iterations needed while at the same time ensuring that the process obtains estimates that 
maximize the objective function. 
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